Skip to content

Multiple dataset training Web Support#503

Merged
BryonLewis merged 10 commits into
masterfrom
multiple-dataset-training
Dec 22, 2020
Merged

Multiple dataset training Web Support#503
BryonLewis merged 10 commits into
masterfrom
multiple-dataset-training

Conversation

@BryonLewis
Copy link
Copy Markdown
Collaborator

@BryonLewis BryonLewis commented Dec 16, 2020

Fixes #391

NOTE - Need the latest kitware/viame:gpu-algorithms-latest for the input_list to work properly.

  • Enabled Training menu button when one or more items are selected in a folder
  • Changes the input type for training to an array of folderIds and updates the API in the relevant locations.
  • Server now takes in the JSON data for the folderIds and does a preprocess check on each dataset to ensure there is some groundtruth csv files there.
  • Updated the training to remove the labels.txt and use the new -il input_folder_list.txt and the -it input_groundtruth_list.txt for specifiying the data.
  • Since it still requires a folder structure for orangization I kept the organize_folder_for_training but removed the labels.txt stuff.
  • --no-query is added to the groundtruth command so it will use all types that would be in the labels.txt by default and prevent the user from being prompted to accept.

I've tested by taking to small datasets with different track types in it and training on them. Then I would run the trained model on another small dataset and ensure that it is using types from both datasets.

Additionally I trained across different folders by using the /viame/train endpoint and manually specifying folderIds across different root folders and different public users. It trained successfully and the resulting pipeline incorporated types across the different folders.

@BryonLewis BryonLewis changed the base branch from client/training-ui to master December 16, 2020 16:13
@BryonLewis BryonLewis linked an issue Dec 16, 2020 that may be closed by this pull request
@BryonLewis BryonLewis force-pushed the multiple-dataset-training branch from a5c8bef to 7e1f6a3 Compare December 16, 2020 18:51
@BryonLewis BryonLewis force-pushed the multiple-dataset-training branch from 0a6567d to 5b17e1a Compare December 21, 2020 17:30
@BryonLewis BryonLewis marked this pull request as ready for review December 21, 2020 17:41
@subdavis
Copy link
Copy Markdown
Contributor

Since it still requires a folder structure for orangization

Could you explain this part? I didn't expect that anything would need to move. You could just do a simple download of each dataset from Girder, then point to the data in-place without moving anything.

Also, I expect you'd like to merge this before #487. I'm fine with that. Just to confirm, this will work with arbitrary dataset ids right? They don't have to be siblings?

@BryonLewis
Copy link
Copy Markdown
Collaborator Author

Could you explain this part? I didn't expect that anything would need to move. You could just do a simple download of each dataset from Girder, then point to the data in-place without moving anything.

besides my massive spelling mistake there (orangization). Bad choice of words for the explanation. I had the assumption that the testing of the ground_truth to see if it is a directory was in there because of some legacy items where the meta.detection might provide a directory instead of a csv so that test needed to remain in there. Really all organize_folder_for_training does right now is check if the ground_truth is a directory, if it is it will copy the first .csv out of it into the training data directory associated with that groundtruth and then deletes the folder. If it is a file it just renames it to ground_truth.csv which is unnecessary but cleaner. I could keep all ground truth at the root level, but if I'm doing the test and possibly moving it why not keep it a bit more organized. I should probably clean up the description and what it is called.

Also, I expect you'd like to merge this before #487. I'm fine with that. Just to confirm, this will work with arbitrary dataset ids right? They don't have to be siblings?

Yeah that was the second part of my testing, I was training across different user's public folders, just required that I manually create the array of dataset ids and call the endpoint because there is no UI for it currently.

Copy link
Copy Markdown
Member

@jjnesbitt jjnesbitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor things but it looks good, haven't tested locally yet

Comment thread server/viame_tasks/tasks.py Outdated
Comment thread server/viame_tasks/tasks.py Outdated
Comment thread server/viame_tasks/tasks.py Outdated
Comment thread server/viame_tasks/tasks.py Outdated
@jjnesbitt
Copy link
Copy Markdown
Member

Really all organize_folder_for_training does right now is check if the ground_truth is a directory, if it is it will copy the first .csv out of it into the training data directory associated with that groundtruth and then deletes the folder. If it is a file it just renames it to ground_truth.csv which is unnecessary but cleaner.

For the record, the reason this is done is in case the girder item has multiple files. If it does, it's a folder when downloaded. Otherwise it's just a file. Since we still use the csv_detection_file method to ensure a csv file on the item, this is likely always the case.

BryonLewis and others added 4 commits December 21, 2020 19:05
Co-authored-by: Jacob Nesbitt <jjnesbitt2@gmail.com>
Co-authored-by: Jacob Nesbitt <jjnesbitt2@gmail.com>
Co-authored-by: Jacob Nesbitt <jjnesbitt2@gmail.com>
Copy link
Copy Markdown
Contributor

@subdavis subdavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

detections = list(
Item().find({"meta.detection": str(folderId)}).sort([("created", -1)])
)
detection = detections[0] if detections else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe refactor viame_detection.py _load_detections() helper function?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, this can be done later.

@BryonLewis BryonLewis merged commit 20826a0 into master Dec 22, 2020
@subdavis subdavis deleted the multiple-dataset-training branch December 22, 2020 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Allow training of multiple datasets at once

3 participants